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ABSTRACT 

A survey of recent literature was undertaken to 
locate validity studies of pa per- and- pencil tests which met the 
following criteria: (1) Studies were conducted in a business or 
industrial (i.e. non- education, non-military) setting; (2) Separate 
statistics were available for blacks and whites; (3) Racu was not 
confounded with some outside variable which would preclude meaningful 
interpretation; (4) Necessary data were reported to enable a test of 
homogeneity of regression between racial groups. For each of 20 
studies which met these criteria, a homogeneity of regression 
analysis was conducted on each predictor-criterion pair to determine 
if there were significant differences between blacks and whites in 
standard errors, slopes, or intercepts of the regression lines. The 
number of significant differences in standard errors and in slopes 
was less than would be expected by chance, indicating that tests do 
not have differential validity between white and black groups. For 
intercepts, significant differences in excess of chance were 
obtained. The direction of the differences was such that job 
performance of blacks was overestimated by tests. (Author) 
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ABSTRACT 



A survey of the recent literature was undertaken to locate validity 
studies of paper-and-pencil tests which met the following criteria: 

1. Studies were conducted in a business or industrial 
(i.e. non-educational, non -military) setting; 

2. Separate statistics were available for blacks and whites; 

3. Race was not confounded with some outside variable 
which would preclude meaningful interpretation; 

4. Necessary data were reported to enable a test of 
homogeneity of regression between racial groups. 

For each of 20 studies which met these criteria/ a homogeneity of 
regression analysis was conducted on each predictor-criterion pair 
to determine if there were significant differences between blacks 
and whites in standard errors/ slopes/ or intercepts of the regression 
lines. The number of significant differences in standard errors and 
in slopes was less than would be expected by ch since , indicating that 
tests do not have differential validity between white and black' groups. 
For intercepts, significant differences in excess of chance were ob- 
tained. The direction of the differences was such that job performance 
of blacks was overestimated by tests. 
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A RE-ANALYSIS OF PUBLISHED DIFFERENTIAL VALIDITY STUDIES* 

William W. Ruch 
Psychological Services, Inc. 

Los Angeles, California 

A survey of the recent literature was undertaken to locate validity 
studies of paper-and-pencil tests which met the following criteria: 

1. Studies were conducted in a business or industrial 
(i.e. non -educational, non -military) setting; 

2. Separate statistics were available for blacks and whites? 

3. Race was not confounded with some outside variable 
which would preclude meaningful interpretation? 

4. Necessary data were reported to enable a test of 
homogeneity of regression between racial groups. 

Studies which met some but not all of the above-stated criteria 
were excluded. Baehr's (1) study of Chicago policemen was excluded 
since the criteria were confounded with race due to the fact that, 
in the words of the author, ”... Negro patrolmen tend more often to 
be assigned to predominately Negro districts, which would include the 
ghetto areas. Since the crime rate in these districts is usually 
higher, it is natural that more arrests will be made. It also 
seems likely that patrolmen who are constantly 'where the action is ' 
will have more complaints made against them and, possibly, have more of 



* Presentation in the symposium, "Differential Validation under EEOC 
and OFCC Testing and Selection Regulations,” at the annual meetings 
of the American Psychological Association, Honolulu, Hawaii, 
September 6, 1972. 
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them sustained." This confounding is evidenced by the fact that in her 
Wave I sample , the average Negro officer made 81 arrests, versus 40 for 
whites; the average Negro officer had 2.6 disciplinary actions versus 
.8 for whites. If v/e view making arrests as good, and being subjected 
to disciplinary action as bad, it is obviously impossible for a test 
which is fair on one criterion to be fair on the other also. 

Also excluded on the basis of confounding was Kirkpatrick's (10) 
study on Nursing Students in which race was 100% confounded with 

institution. Another reason for omitting this study was that it is 

/ 

in an educational f rather than in an industrial setting. Kirkpatrick's 
(10) Study 5 on Clerical Insurance Employees was also omitted on the 
basis of a near 100% confounding of race and institution. Since criteria 
were standardized within institution there was no way to test for 
the significance of differences between standard errors, slopes or 
intercepts, since all of these require between-group comparisons of 
criterion means and/or standard deviations. 

Since only continuous variables will fit the model employed, all 
dichotomous turnover criteria were deleted. This resulted in omitting 
the study by Ruda and Albright (16) in which turnover was the sole 
criterion and deleting the turnover criterion from several of the 
Farr and O'Leary studies (3). Also deleted from several of the 
Farr and O'Leary (3) studies were the "extension of probation" 
and the "promotion" criteria. While a study by Lefkowitz (11) 
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involved U continuous turnover criterion based upon the number of days 
employees remained with the company* with a maximum of 60, this study 
was omitted due to the markedly truncated and U-shaped nature of the 
distribution. 

Studies by Lopez (4), Mitchell and Albright (14), and Wollowick (18) 
were omitted because all of the data necessary for the present analysis 
were not reported. 

All of the military studies were omitted somewhat arbitrarily. 

These include 11 studies by Gordon (6), 10 by Guinn, Tupes andAlley(8), 
and 1 by Farr et al (4). As will be seen, these studies and the 
ones in the present report yield essentially the same results. 

PROCEDURE 

In most of the studies there were several predictors and several 
criteria. In order to look for patterns, each combination of predictor 
and criterion was treated separately. That is, if a study had 6 
predictors and 10 criteria, 60 analyses were made for that study. 

It must be pointed out that the criteria are highly intercorrelated - 
not only for the usual reasons of halo, but also because they included 
many part-whole and corrected-un corrected situations. The predictors 
are also highly intercorrelated, sometimes involving two scoring 
formulas and two or three time limits for the same test. Thus, elements 
in the test by criterion matrix certainly cannot be treated as 
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The method of analysis used was that of significance tests 
of homogeneity of regression between whites and blacks as formulated 
by Gulliksen and Wilks (9). Since so few studies involved other 
minorities, these subjects were not included in the present analysis. 
Therefore, all conclusions must be limited to white versus black 
differences . 

Within each study and for each predictor- criterion combination, 
three significance tests were run. First, the significance of the 
difference between standard errors of whites and of blacks was 
assessed. If this proved to be significant, the second two tests 
were not run. If it was not significant, the significance of 
differences between slopes of regression lines was assessed. If 
this was significant, the final test was not run. If it was non- 
significant, the significance of differences of intercepts was 
assessed. The 5% level of significance was used throughout. 

If the standard errors were significantly different, the 
difference was indicated as negative if the standard error was 
smaller for whites and positive if it was greater. A significant 
difference between slopes was recorded as negative if the slope 
was smaller for blacks and pos itive if the slope was greater for 
blacks. If there was a significant difference between intercepts, 
the black test mean was plugged into the white regression equation 
to predict the black criterion mean and the actual black criterion 
mean was subtracted from the results. Thus, a plus means that the 
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test overpredicts for blacks - that is the test is unfair to whites - 
and a minus means that the test underpredicts for blacks - that is, it 
is unfair to blacks. Popular hypotheses which have been advanced 
are that tests are less valid for blacks and/or they are unfair 
to blacks. These hypotheses are symbolized in this analysis with 

a minus for standard error, a minus for slope and a minus for intercept. 

1 

RESULTS 

Results for each of the 20 studies included in this report are 
attached. The first is Kirkpatrick's Study 1 on Female Clerical Workers 
in an Insurance Company. Note that there are four predictors and five 
criteria for a total of 20 validity regression equations. For each 
of these 20, the test of significance of difference in standard 
errors was conducted. Had any of these been significant the letters 
"SE" would have been placed in the appropriate element in the matrix. 

No significant differences were found so none is reported. 

The next test was for significance of difference in slopes. 

None was found and none is reported. The final test was for significance 
of intercepts. We see that all four tests showed significantly 
different intercepts when predicting the merit rating criterion. 

Since all of these are indicated as minus, we know that the criterion 
was underpredicted for blacks. That is, the tests would, be called 
unfair to blacks. Out of 20 significance tests made for differences 
in intercepts, four were found to be significant. Since this is greater 
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than the 5% which would be expected by chance, it is so indicated in 
the summary table, with a minus which indicates the direction of 
the difference. The zero for the standard error and the zero 
~>r the slope indicate that there were fewer significant findings than 
would be expected by chance. For each of the 20 studies included in 
this report, standard error, slope and intercept are each designated 
as whether they show significance in one direction, have fewer than 
chance significant findings, or are significant in the other direction. 

I would like to point out in this first study that although 
it is tabulated as a situation in which tests are unfair to blacks, 
there is perhaps a better reason to suspect the criterion. If a test 
is unfair - that is has a cultural bias - we would expect it to be 
unfair against many or all of the criteria. That is, we would 
expect to have significant findings in rows corresponding to the biased 
tests. However, in this instance, we have significant findings in 
a column, perhaps indicating a biased criterion in which blacks are 
rated spuriously high. Note that this is the only subjective criterion 
in Study 1. 

Let's look at one more table. Study 2 is Kirkpatrick's second 
study of Female Clerical Employees in an Insurance ComDany . Out of 
28 significance tests we find 16 significant differences in standard 
errors, far more than would be expected by chance. Note that they 
are all in the same direction, with the standard error being 
greater for whites than for blacks. In every instance when more 
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than one significant finding was found in a study, all were in the same 
direction. This is due to the high intercorrelation of tests and 
intercorrc .‘.ation of criteria, which, incidentally, make a joint 
probability test of significance impossible. For the 12 situations 
in which the standard error was not significant, the slope test was 
run, yielding one significant finding . This is indicated in the 
summary table as greater than chance, since one out of 12 is greater 
than 5%. Of all 11 tests of intercepts run, 3 were significant, 
also greater than chance. The study is ^scored as plus , minus , plus 
indicating that all three tests were in excess of chance with the 
standard error being greater for whites, the slope being smaller for 
blacks, and the tests being unfair to whites. 

Under the null hypothesis that tests work exactly the same for 
blacks as they do for whites, in the long run we would expect to 
find exactly one-half of our studies showing statistical significance 
in excess of chance and exactly one-half of our studies showing statistical 
significance fewer times than would be predicted by chance. 

Thus, if chance alone were operating in the 20 studies, we would 

expect 10 studies in which fewer significant findings occurred 

than would be expected by chance, and .10 in which more significant 

findings occurred than would be expected by chance. Of these 10 

in excess of chance, we would expect 5 to be in the plus direction 

and 5 to be in the minus direction. That is, for the 20 studies in 

this report, if only chance were operating we would expect the distribution 

of t zero and plus to be 5, 10 and 5. 
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The summary on page 33 gives these tabulations . We see that 
for tests of standard error, we get 13 with fewer than chance significance, 
2 with more than chance significance and the standard error smaller for 
whites, and 5 with more than chance significance and the standard error 
greater for whites. There is no evidence whatsoever that the standard 
errors differ between the two groups . The summary of the tests of slopes 
is also a chance distribution. There is no evidence of differential 
validity . The tests for intercepts dc show a significant pattern 
as evidenced by a chi-square of 6.40, significant at the 5% Level. 

However, the criterion scores of blacks are overestimated by 
tests, not underestimated . That is, interpreted in the same manner 
for blacks as they are for whites, tests are unfair to whites. 

Another way to summarize all of the enclosed studies is simply 
to count up the total number of significant findings and to divide 
by the total number of significance tests run. This has the undesirable 
effect of weighting studies by the number of validity coefficients 
run (number of predictors times number of criteria) . Also, since the 
tests and the criteria are so highly intercorrelated, there is no 
way to assess the significance of any departure from the 5% expected 
under the null hypothesis. 

Of the 618 tests of significance of differences between standard 
errors, 72 (12%) were significant at the 5% level. Of the 546 tests 
for slopes, 64 (12%) were significant. Of the 482 tests for intercepts 
87 (18%) were significant. While these are somewhat in excess of 
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chance, it shoclU be pointed out that 30 of the 72 significant standard 
error Lests were all from study 11, and 39 of the significant 64 
slopes tescs were all from study 13. Beth of these studies were 
of the same set of predictors which included multiple scoring 
formulas and multiple time limits for the same tests, thus making 
repeated significance tests on practically the same data. 

The problems caused by these spuriously intercorrelated predictors 
as. well as the problem of weighting findings by total number of 
validity coefficients can be offset to a great extent by determining 
the median percent of significant findings across the 20 studies. 

For the significance of the difference between standard errors, the 
median percent significant was zero ** certainly ‘ less than the 5% 
expected by chance. For slopes, the median percent significant was 
between 2% and 3%, and for intercepts between 4% and 14%. 

While this method of summarizing the results of the studies 
has weaknesses, it is consistent with the summary on page 33. 

Certainly these 20 studies do not tell the whole story. The 
evidence that they do provide is that there is no such thing as 
differential validity but there is a tendency of tests to overestimate 
black job performance. This is exactly what has been found in most 
of the studies in the military. Guinn, Tupes and Alley (8) summarized 
10 studies by stating that: "Assuming that the performance criterion 

was unbiased, results indicate that when statistically significant 
differences in levels of regression lines were found the performance 
of Negroes and high school non-graduates tended to be overestimated.” 
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If we follow the OFCC and EEOC Guidelines, and conduct validation 
studies separately for blacks, we are likely to find that between- 
group differences in test scores do not correspond to between-group 
differences in job performance. If we then follow the Guidelines 
and adjust cutoff scores "so as to predict: the same probability of 
job success in both groups,” we will have to raise , not lower, the 
passing scores for blacks. Thus, following the OFCC and EEOC 
Guidelines will reduce , not increase, the employment opportunities 
of blacks. 
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Summary of Significance Tests of Differences in Regression 
Equations Between Whites and Blacks 
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Study 



St. Err. Slope 



1. Kirkpatrick et^ al_ (10) - Study 1 - Clerical - Insurance 0 

2. Kirkpatrick et al (10) - Study 2 - Clerical - Insurance + 

3. Kirkpatrick et al (10) - Study 3 - Man. Occ. - Gen. Maint. 0 

4. Kirkpatrick et al (10) - Study 3 - Man. Occ. - Heavy Load 0 

5. Farr et al (3) - Toll Collectors 0 

6. Farr et^ al_ (3) - Correctional Officers 0 

7. Farr et^ al_ (3) - Toll Facility Officers + 

8. Farr eit al_ (3) - Home Office Clerical 0 

9. Farr et al (3) - Keypunch Operators + 

10. O'Leary et_ al^ (13) - Catalog Order Plant - Mat'l Hndlr I 0 

11. O'Leary ejt a_l (13) - Catalog Order Plant - Mat'l Hndlr II + 

12. O'Leary et_ al_ (13) - Catalog Order Plant - Clerical I + 

13. O'Leary et al^ (13) - Catalog Order Plant - Machine Clerical 0 

14. O'Leary et_ al^ (13) - Catalog Order Plant - Misc. Clerical 0 

15. Farr et al (4) - Health Insurance (Misc. Clerical) 

16. Farr et^ al_ (4) - Health Insurance (Clerk, Clerk -Typist) 0 

17. Tenopyr (17) - Machine Shop 

18. Campbell £t al^ (2) - Medical Technicians 0 

19. Grant and Bray (7) - Telephone Installation and Repair 0 

20. Gael and Grant (5) - Telephone Service Representatives 0 



0 

0 

0 

+ 

0 

0 

0 

0 

0 

0 

0 



0 

+ 



Intercept 



+ 

0 

0 

0 

0 

0 

+ 

0 

0 

0 

0 

+ 

0 

+ 

+ 

+ 

+ 

+ 

+ 



Expected Under 

STANDARD ERROR Null Hypothesis 

(-) More than chance significant, SE smaller 

for whites 5 

(0) Fewer than chance significant 10 

(+) More than chance significant, SE greater 5 

for whites 



Obtained 

) 

2 ) = 2.70 

) (df = 2) 

13 ) . 40 < p < .50 

) 

5 ) 

) 



SLOPE 



(-) More than chance significant, slope 

smaller for blacks 5 

(0) Fewer than chance significant 10 

(+) More than chance significant, slope 5 

greater for blacks 



7 ) ?f t = 2.70 

) (df * 2) 

11 ) .40 < p < .50 

) 

2 ) 

) 



INTERCEPT 

(-) More than chance significant, unfair 
to blacks 

(0) Fewer than chance significant 

(+) More than chance significant, unfair 
to whites 



5 

10 



r*e: 



1 ) ?f Z = 6.40 

) (df, = ~2) 

10 ) .02 < p < .05 

) 

9 ) 

) 



5 



